Recent work has shown that fine-tuning large pre-trained language models on a collection of tasks described via instructions, a.k.a. instruction-tuning, improves their zero and few-shot generalization to unseen tasks. However, there is a limited understanding of the performance trade-offs of different decisions made during the instruction-tuning process. These decisions include the scale and diversity of the instruction-tuning benchmark, different task sampling strategies, fine-tuning with and without demonstrations, training using specialized datasets for reasoning and dialogue, and finally, the fine-tuning objectives themselves. In this paper, we characterize the effect of instruction-tuning decisions on downstream task performance when scaling both model and benchmark sizes. To this end, we create OPT-IML Bench: a large benchmark for Instruction Meta-Learning (IML) of 2000 NLP tasks consolidated into task categories from 8 existing benchmarks, and prepare an evaluation framework to measure three types of model generalizations: to tasks from fully held-out categories, to held-out tasks from seen categories, and to held-out instances from seen tasks. Through the lens of this framework, we first present insights about instruction-tuning decisions as applied to OPT-30B and further exploit these insights to train OPT-IML 30B and 175B, which are instruction-tuned versions of OPT. OPT-IML demonstrates all three generalization abilities at both scales on four different evaluation benchmarks with diverse tasks and input formats -- PromptSource, FLAN, Super-NaturalInstructions, and UnifiedSKG. Not only does it significantly outperform OPT on all benchmarks but is also highly competitive with existing models fine-tuned on each specific benchmark. We release OPT-IML at both scales, together with the OPT-IML Bench evaluation framework.
translated by 谷歌翻译
Scaling up language models has led to unprecedented performance gains, but little is understood about how the training dynamics change as models get larger. How do language models of different sizes learn during pre-training? Why do larger language models demonstrate more desirable behaviors? In this paper, we analyze the intermediate training checkpoints of differently sized OPT models (Zhang et al.,2022)--from 125M to 175B parameters--on next-token prediction, sequence-level generation, and downstream tasks. We find that 1) at a given perplexity and independent of model sizes, a similar subset of training tokens see the most significant reduction in loss, with the rest stagnating or showing double-descent behavior; 2) early in training, all models learn to reduce the perplexity of grammatical sequences that contain hallucinations, with small models halting at this suboptimal distribution and larger ones eventually learning to assign these sequences lower probabilities; 3) perplexity is a strong predictor of in-context learning performance on 74 multiple-choice tasks from BIG-Bench, and this holds independent of the model size. Together, these results show that perplexity is more predictive of model behaviors than model size or training computation.
translated by 谷歌翻译
预先训练的蒙版语言模型通过将下游任务作为文本填充来成功执行几次学习。但是,作为全镜头环境中的强大替代方案,诸如Electra之类的判别预训练模型不适合范式。在这项工作中,我们调整了基于及时的几次学习来进行电信,并表明它在广泛的任务中优于蒙面的语言模型。Electra是预先训练的,以区分令牌是产生还是原始。我们自然地将其扩展到基于迅速的几次学习,通过培训来评分目标选项的原创性,而无需引入新参数。我们的方法很容易适应涉及多token预测的任务,而无需额外的计算开销。分析表明,Electra学习分布与下游任务更好。
translated by 谷歌翻译
专家层(MOES)的混合物通过条件计算实现语言模型的高效缩放。本文提出了一个详细的实证研究,自回归鞋语言模型与广泛的设置中的密集模型相比:在域外语言建模,零和少量射击和全部微调。除了微调外,我们发现Moes基本上更加计算效率。在更适度的培训预算下,MOES可以使用$ \ SIM值4倍的计算,符合密集模型的性能。该差距在比例下变窄,但我们最大的MOE模型(1.1T参数)始终如一地优于计算等效的密集模型(6.7b参数)。总体而言,这种表现差距在任务和域中有很大差异,表明MOE和密集模型以不值得研究的方式概括不同的方式。我们使我们的代码和模型公开可用于研究使用。
translated by 谷歌翻译
We present BART, a denoising autoencoder for pretraining sequence-to-sequence models. BART is trained by ( 1) corrupting text with an arbitrary noising function, and (2) learning a model to reconstruct the original text. It uses a standard Tranformer-based neural machine translation architecture which, despite its simplicity, can be seen as generalizing BERT (due to the bidirectional encoder), GPT (with the left-to-right decoder), and many other more recent pretraining schemes. We evaluate a number of noising approaches, finding the best performance by both randomly shuffling the order of the original sentences and using a novel in-filling scheme, where spans of text are replaced with a single mask token. BART is particularly effective when fine tuned for text generation but also works well for comprehension tasks. It matches the performance of RoBERTa with comparable training resources on GLUE and SQuAD, achieves new stateof-the-art results on a range of abstractive dialogue, question answering, and summarization tasks, with gains of up to 6 ROUGE. BART also provides a 1.1 BLEU increase over a back-translation system for machine translation, with only target language pretraining. We also report ablation experiments that replicate other pretraining schemes within the BART framework, to better measure which factors most influence end-task performance.
translated by 谷歌翻译
The number of international benchmarking competitions is steadily increasing in various fields of machine learning (ML) research and practice. So far, however, little is known about the common practice as well as bottlenecks faced by the community in tackling the research questions posed. To shed light on the status quo of algorithm development in the specific field of biomedical imaging analysis, we designed an international survey that was issued to all participants of challenges conducted in conjunction with the IEEE ISBI 2021 and MICCAI 2021 conferences (80 competitions in total). The survey covered participants' expertise and working environments, their chosen strategies, as well as algorithm characteristics. A median of 72% challenge participants took part in the survey. According to our results, knowledge exchange was the primary incentive (70%) for participation, while the reception of prize money played only a minor role (16%). While a median of 80 working hours was spent on method development, a large portion of participants stated that they did not have enough time for method development (32%). 25% perceived the infrastructure to be a bottleneck. Overall, 94% of all solutions were deep learning-based. Of these, 84% were based on standard architectures. 43% of the respondents reported that the data samples (e.g., images) were too large to be processed at once. This was most commonly addressed by patch-based training (69%), downsampling (37%), and solving 3D analysis tasks as a series of 2D tasks. K-fold cross-validation on the training set was performed by only 37% of the participants and only 50% of the participants performed ensembling based on multiple identical models (61%) or heterogeneous models (39%). 48% of the respondents applied postprocessing steps.
translated by 谷歌翻译
Timely and effective feedback within surgical training plays a critical role in developing the skills required to perform safe and efficient surgery. Feedback from expert surgeons, while especially valuable in this regard, is challenging to acquire due to their typically busy schedules, and may be subject to biases. Formal assessment procedures like OSATS and GEARS attempt to provide objective measures of skill, but remain time-consuming. With advances in machine learning there is an opportunity for fast and objective automated feedback on technical skills. The SimSurgSkill 2021 challenge (hosted as a sub-challenge of EndoVis at MICCAI 2021) aimed to promote and foster work in this endeavor. Using virtual reality (VR) surgical tasks, competitors were tasked with localizing instruments and predicting surgical skill. Here we summarize the winning approaches and how they performed. Using this publicly available dataset and results as a springboard, future work may enable more efficient training of surgeons with advances in surgical data science. The dataset can be accessed from https://console.cloud.google.com/storage/browser/isi-simsurgskill-2021.
translated by 谷歌翻译
A real-world application or setting involves interaction between different modalities (e.g., video, speech, text). In order to process the multimodal information automatically and use it for an end application, Multimodal Representation Learning (MRL) has emerged as an active area of research in recent times. MRL involves learning reliable and robust representations of information from heterogeneous sources and fusing them. However, in practice, the data acquired from different sources are typically noisy. In some extreme cases, a noise of large magnitude can completely alter the semantics of the data leading to inconsistencies in the parallel multimodal data. In this paper, we propose a novel method for multimodal representation learning in a noisy environment via the generalized product of experts technique. In the proposed method, we train a separate network for each modality to assess the credibility of information coming from that modality, and subsequently, the contribution from each modality is dynamically varied while estimating the joint distribution. We evaluate our method on two challenging benchmarks from two diverse domains: multimodal 3D hand-pose estimation and multimodal surgical video segmentation. We attain state-of-the-art performance on both benchmarks. Our extensive quantitative and qualitative evaluations show the advantages of our method compared to previous approaches.
translated by 谷歌翻译
在我们不断变化的气候中,使用模型来评估天气和气候对社会和企业的后续后果的风险及其后续后果至关重要。这种模型的操作在历史上是定制的,并限制在特定的计算基础架构,驱动数据集和预定义的配置上。这些约束通过缩放模型运行并将模型掌握在感兴趣的用户手中。在这里,我们提出了一个基于云的模块化框架,用于部署和操作地理空间模型,最初应用于气候影响。气候冲击建模框架(CIMF)可以以动态和灵活的方式部署模块化工作流程。用户可以以简化的方式指定工作流程组件,然后可以轻松地将这些组件组织成不同的配置,以以不同的方式和不同的尺度评估风险。这还可以使不同的模型(物理模拟或机器学习模型)和工作流程连接以产生合并的风险评估。洪水建模被用作端到端的示例,以证明CIMF的操作。
translated by 谷歌翻译
希望以优先的,有序的方式相结合,因为它允许模块化设计并通过知识传输来促进数据重用。在控制理论中,优先的组合物是通过空空间控制实现的,其中低优先级控制动作被投影到高优先级控制动作的空空间中。这种方法目前无法用于加强学习。我们为增强学习提出了一个新颖的,任务优先的组成框架,其中涉及一个新颖的概念:强化学习政策的冷漠空间。我们的框架有可能促进知识转移和模块化设计,同时大大提高数据效率和增强学习代理的数据重用。此外,我们的方法可以确保高优先级的限制满意度,这使得在机器人技术等安全 - 关键领域中学习有望。与零空间的控制不同,我们的方法允许通过在最初的复合策略构建后在高级政策的无差异空间中在线学习来学习复合任务的全球最佳策略。
translated by 谷歌翻译